6 research outputs found
Look and Modify: Modification Networks for Image Captioning
Attention-based neural encoder-decoder frameworks have been widely used for
image captioning. Many of these frameworks deploy their full focus on
generating the caption from scratch by relying solely on the image features or
the object detection regional features. In this paper, we introduce a novel
framework that learns to modify existing captions from a given framework by
modeling the residual information, where at each timestep the model learns what
to keep, remove or add to the existing caption allowing the model to fully
focus on "what to modify" rather than on "what to predict". We evaluate our
method on the COCO dataset, trained on top of several image captioning
frameworks and show that our model successfully modifies captions yielding
better ones with better evaluation scores.Comment: Published in BMVC 201
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
Natural Language Explanations (NLE) aim at supplementing the prediction of a
model with human-friendly natural text. Existing NLE approaches involve
training separate models for each downstream task. In this work, we propose
Uni-NLX, a unified framework that consolidates all NLE tasks into a single and
compact multi-task model using a unified training objective of text generation.
Additionally, we introduce two new NLE datasets: 1) ImageNetX, a dataset of
144K samples for explaining ImageNet categories, and 2) VQA-ParaX, a dataset of
123K samples for explaining the task of Visual Question Answering (VQA). Both
datasets are derived leveraging large language models (LLMs). By training on
the 1M combined NLE samples, our single unified framework is capable of
simultaneously performing seven NLE tasks including VQA, visual recognition and
visual reasoning tasks with 7X fewer parameters, demonstrating comparable
performance to the independent task-specific models in previous approaches, and
in certain tasks even outperforming them. Code is at
https://github.com/fawazsammani/uni-nlxComment: Accepted to ICCVW 202
Show, Edit and Tell: A Framework for Editing Image Captions
Most image captioning frameworks generate captions directly from images,
learning a mapping from visual features to natural language. However, editing
existing captions can be easier than generating new ones from scratch.
Intuitively, when editing captions, a model is not required to learn
information that is already present in the caption (i.e. sentence structure),
enabling it to focus on fixing details (e.g. replacing repetitive words). This
paper proposes a novel approach to image captioning based on iterative adaptive
refinement of an existing caption. Specifically, our caption-editing model
consisting of two sub-modules: (1) EditNet, a language module with an adaptive
copy mechanism (Copy-LSTM) and a Selective Copy Memory Attention mechanism
(SCMA), and (2) DCNet, an LSTM-based denoising auto-encoder. These components
enable our model to directly copy from and modify existing captions.
Experiments demonstrate that our new approach achieves state-of-art performance
on the MS COCO dataset both with and without sequence-level training.Comment: Accepted to CVPR 202
Visualizing and Understanding Contrastive Learning
Contrastive learning has revolutionized the field of computer vision,
learning rich representations from unlabeled data, which generalize well to
diverse vision tasks. Consequently, it has become increasingly important to
explain these approaches and understand their inner workings mechanisms. Given
that contrastive models are trained with interdependent and interacting inputs
and aim to learn invariance through data augmentation, the existing methods for
explaining single-image systems (e.g., image classification models) are
inadequate as they fail to account for these factors. Additionally, there is a
lack of evaluation metrics designed to assess pairs of explanations, and no
analytical studies have been conducted to investigate the effectiveness of
different techniques used to explaining contrastive learning. In this work, we
design visual explanation methods that contribute towards understanding
similarity learning tasks from pairs of images. We further adapt existing
metrics, used to evaluate visual explanations of image classification systems,
to suit pairs of explanations and evaluate our proposed methods with these
metrics. Finally, we present a thorough analysis of visual explainability
methods for contrastive learning, establish their correlation with downstream
tasks and demonstrate the potential of our approaches to investigate their
merits and drawbacks
Deep convolutional networks for magnification of DICOM Brain Images
Convolutional neural networks have recently achieved great success in Single Image Super-Resolution (SISR). SISR is the action of reconstructing a high-quality image from a low-resolution one. In this paper, we propose a deep Convolutional Neural Network (CNN) for the enhancement of Digital Imaging and ommunications in Medicine (DICOM) brain images. The network learns an end-to-end mapping between the low and high resolution images. We first extract features from the image, where each new layer is connected to all previous layers. We then adopt residual learning and the mixture of convolutions to reconstruct the image. Our network is designed to work with grayscale images, since brain images are originally in grayscale. We further compare our method
with previous works, trained on the same brain images, and show that our method outperforms them
EEG Signal Analysis of Stroke Patients with Applications of Deep Learning
The methodology section is usually the second-longest section in the abstract. It should contain enough information to enable the reader to understand what was done, and important questions to which the methods section should provide brief answers